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Abstract. Reinforcement learning (RL) involves sequential decision making 
in uncertain environments. The aim of the decision-making agent is to maxi- 
mize the benefit of acting in its environment over an extended period of time. 
Finding an optimal policy in RL may be very slow. To speed up learning, one 
often used solution is the integration of planning, for example, Sutton's Dyna 
algorithm, or various other methods using macro-actions. 

Here we suggest to separate plannable, i.e., close to deterministic parts of 
the world, and focus planning efforts in this domain. A novel reinforcement 
learning method called plannable RL (pRL) is proposed here. pRL builds a 
simple model, which is used to search for macro actions. The simplicity of 
the model makes planning computationally inexpensive. It is shown that pRL 
finds an optimal policy, and that plannable macro actions found by pRL are 
near- optimal. In turn, it is unnecessary to try large numbers of macro actions, 
which enables fast learning. The utility of pRL is demonstrated by computer 
simulations. 



1. Introduction 

1.1. Reinforcement Learning. Reinforcement learning involves sequential de- 
cision making in uncertain environments. The sequential aspect of the decision 
problem reflects the fact that the immediate cost or benefit of any state of the 
environment may play only a small part in determining the true value of any state. 
The aim of the decision-making agent is to develop a policy, which maximizes the 
benefit of acting in its environment over an extended period of time. 

An often used and efficient framework, which describes stochastic, sequen- 
tial decision problem s, is the Markov decision process (MDP) (for a review, 
see [ Puterman, 1994 1). When a problem description satisfies the require- 
ments of the MDP framework, well-known algorithms can be used to deter- 
mine an optimal policy, such as various forms of the dynamic programming 
[Bellman 195?! [Bertsekas, 1987| ^utton, 199l| , Q-learning [ |Watkins, 1989 
SARSA ISingh and Sutton, 1996[ methods. 



or 



These algorithms proceed by maintaining a value function, updating it ac- 
cording to the experiences and propagating the values of states. Under ap- 
propriate conditions, RL algorithms are shown to develop an optimal policy 
[Bertsekas, 1987 , Singh et al., 200C|. However, the basic forms of these algo- 
rithms propagate the values step-by-step, therefore convergence may be very 
slow. Various methods have been developed to overcome this difficulty, such 



as prioritized sweeping [Moore and Atkeson, 1993 1, eligibility traces | Sutton, 1988 



Singh and Sutton, 199£| and also planning methods to be described below. 
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1.2. Planning. Besides RL, anotlier successful and well-studied approach to solv- 
ing decision problems is planning. As a central problem of classical AI research, 
effective algorithms have been developed to solve planning problems. However, 
classical decision-theoretic methods usually assume that the world is deterministic. 
This is an appropriate approximation in some cases, but not always. There may 
be certain parts of the world, that are highly stochastic, making classical plan- 
ning unreliable, or even useless. It is a plausible idea to integrate planning and 
reinforcement learning to gain more general and robust methods. 

Sutton and colleagues as well as others | Sutton, 1991, Sutton, 1990, 



Peng and Williams, 1993, Forbes and Andre, 200C| have integrated planning and 
learning in the Dyna architecture: Planning is treated as being virtually identi- 
cal to reinforcement learning. Learning updates the appropriate value function 
estimates according to experience as it actually occurs, whereas planning updates 
the same value function estimates for simulated transitions chosen from the world 
model. 



Other researchers [ Sacerdoti, 1977 , Korf, 1985 | used planning to develop macro 
actions (i.e., fixed sequences of actions) that could speed-up value propagation 
and learning. A macro action cou ld be a complete sub-policy (such as 'go 

1998| , iPrecup et al., 1998 



to the door' 



McGovern and Sutton, 1998 | 



search a wall ', etc.) [ Hauskrecht ct al 

The main difficulty with macro actions is how to 
construct them: they must be either handcrafted (see, e.g., [Kaelbling, 1993 



Kalmar et al., 1998 1) or one must try to generate them automatically (see, e. 
[Pictterich, 200C |). In this latter case a great num ber of useless macros might 



be generated, which might even deteriorate learning [McGovern and Sutton, 1998 
Kalmar and Szepesvari, 1999|. 



1.3. Thesis of the Paper. We propose a method called plannable RL (pRL), 
that has similarities with both planning methods mentioned: we make focused 
value updates based on hypothetical experiences gained from a model. This is 
similar to the Dyna architecture. On the other hand, we maintain two separate 
(but interacting) value functions, one for learning and one for planning. We use a 
simple model to evaluate the value function for planning: values are updated when 
transitions are considered plannable (i.e., when those are close to deterministic). 
This planning-value function is then used to compute macro actions. We show that 
these macro actions are near- optimal. This means that we do not need to generate 
large numbers of (possibly bad) macro actions, and fast learning is still possible. 

1.4. Structure of the Paper. First, the basis of RL methods are reviewed (Sec- 
tion ||. The pRL algorithm is described and a pseudo-code of the algorithm is 
provided in Section ^ The near optimality of the algorithm is proven in Section I , 
Computational demonstration are provided in the next section, i.e., in Section 5, 
The paper is finished by a discussion (Section @). 



2. Preliminaries 

2.1. Reinforcement Learning (RL) and the MDP framework. Let us recall 
the definition of a Markov decision process (MDP) jPutcrman, 1994 1. A (finite) 
MDP is defined by the tuple {X, A, R, P), where X and A denotes the finite set 
of states and actions, respectively. P : X x A x X ^ [0,1] is called the transition 
function; P{x, a, y) gives the probability of arriving at state y after executing action 
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a in state x. Finally, i? : X x yl ^ M is the reward function, R{x, a) gives the 
immediate reward for choosing action a in state x. 

At each sequence of discrete time steps, t = 0, 1, . . ., the problem-solving agent 
observes its world to be in state Sf e S* and executes an action at G A. After this 
action, the agent receives reward rt+i G M from the world and observes the next 
state, st+i according to the functions R and P. 

The agent's objective is to choose each action at so as to maximize the expected 
discounted return, TZt, i.e., E (jZt — rt+i + ^rt+2 + 7^''t+3 + ...), where < 7 < 1 
is the discount factor and E[.) denotes the expected value of the argument. 

A Markovian policy tt is a 5' x A — » [0,1] mapping, where 'k{x, a) is the probability 
that in state x the agent selects action a. The value of a state x under policy tt 
is the expected value of the total discounted reward of starting from x and then 
following policy tt. Formally, V^lx) — Ep^Tr{J2tLo'^*''^t\^o ~ (^■^■' expected 
value depends on the transition probabilities and the policy applied). These values 
satisfy the recursive equations 

V'^ix) = Tr{x, a) I R{x, a) + 7 ^ P{x, a, y)^ {y) for all x. 

A policy TT* is optimal, if for all policies tt, (x) > V'^{x) for all a; G S' (it is easy 
to show that such a policy exists). The corresponding value function satisfies 

(2.1) V*{x) = max I R{x,a) + -f^^ P{x,a,y)V* {y) 



yes 



A standard way to find an optimal policy is to compute the optimal value function 
: X ^ M, which gives the value (i.e., the expected accumulated discounted 
reward) for each starting state. From V* , the optimal policy can be derived: the 
'gree dy' policy with respect to the optimal value function is an optimal policy (see, 
e.g., llSutton and Barto, 1998 1 and references therein). This is the well-known value 



iteration approach Bellman, 1957 . Functions , V* are called value functions, 
which are associated with a fixed policy tt and with the optimal policy, respectively. 

We can also define the action value function (5'^(x, a) as the expected value of 
the total discounted reward of starting from state x, choosing action a and then 
following policy tt. This value function is often more useful, because it provides the 
values of individual actions. It is easy to see that 



Q^{x, a) = Rix, a) -f 7 2. P{^, a, y)V^y), 
yes 

Vix) ^ ^7r(x,a)Q''(x,a) 

aeA 

and thus 

Q^ix, a) = R{x, a)+^J2 ^(^' «' 2/) J2 a')Q^{y^ «')• 

yes a'eA 

The optimal action- value function Q* satisfies an equation analogous to V*: 
Q*(x, a) — R{x, a) + ■y} P{x, a, y) ma.xQ*{y, a'). 

^ — ' a'eA 

yes 
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2.2. Dynamic Programming and Two Basic RL Algorithms. Equation 2.1 
can be used as an iteration: 



(2.2) Vt+i (x) = max I a) + 7 V P{x, a, y)Vt (y) 



aeA 



y&S 



for every state x, and for an arbitrary Vq. The iteration is known to converge to 
V* . Tlie method is called value iteration. 

If the model of the environment (i.e. R and P) is not known, or the state 
space is too large for solving Eq. (|2.2|), then sampling methods can be used. One 



such algorithm is called Q-learning Watkins, 1989( 1. Q-learning uses the following 
iteration: 



(2.3) Qt+i{st, at) = (1 - at)Qt{st, at) + at\rt + ^jnax.Qt(st+i,a] 

where St, at and ft are the state of the system, the selected action, and the im- 
mediate reward at time step respectively, Qo is arbitrary, and < at < 1 is the 
learning rate. 

It has been shown that if at converges to properly (^^ at is divergent, but 
^( is convergent), and if every [x, a) pair is updated infinitely often, then Qt con - 



verges to Q* with probability 1. The proof can be found, e.g., in [ Singh et al., 2000 1 



Another RL method is SARSA, which takes sampling to the extreme. It has an 



update rule similar to (2.3) 



(2.4) Qt+i{st, at) = (1 - at)Qtist, at) + at (n + jQt{st+i, at+i)) , 

where Q-values of the iteration are action value estimates of two state-action pairs, 
the one which just occurred and its predecessor. SARSA is convergent under the 
same assumptions as Q-learning [ ^ingh et al., 2000 1. A comprehensive overview of 



various reinforcement learning methods can be found in [ Sutton and Barto, 1998] 
and in the references therein. 

2.3. Planning with Dyna. Reinforcement learning often requires a large number 
of experiences (i.e. (st, aj, rj, St+i) tuples) to develop an appropriate policy. This is 
efficient when experience can be collected quickly, provided that one can afford the 
cost of explorations. Dyna offers a solution to improve the exploitation of previous 
experiences by integrating planning into the learning process. 

Informally, planning is a process of computing a (near-)optimal policy for the 
existing (possibly inaccurate) model of the environment. Planning is an off-line 
method, it improves the policy without invoking additional interactions with the 
environment. DP, for example, executed on the available model, is a planning al- 
gorithm. Limitation arises if the model is not given: experience is needed to build 
a model. Another problem appears when DP is computationally intensive: per- 
forming a single and complete DP iteration requires OdiSp) computation steps. 
Efforts have been made to overcome this drawback, yielding solutions like pri- 
oritized sweeping [ Moore and Atkeson, 1993[ , or real-time dynamic programming 



[Barto et al., 1995 [ 



Another successful approach is to combine reinforcement learning and dynamic 
programming in a single algorithm. This approach was first suggested by Sutton, 
who called the algorithm Dyna. The basic idea is that one DP update on one single 
state can be interpreted as a reinforcement learning step. In the Dyna architecture 
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the agent repeats the following steps: (1) obtain experience from the environment, 
(2) use this to update the value function, (3) use the experience to improve the 
model of the environment (e.g. to approximate the transition probabilities), (4) 
obtain hypothetical experience from the updated model, and (5) use the hypothetical 
experience to update the value function. Note that Dyna focuses DP iterations on 
previously visited (and thus presumably important) regions of the state space, and 
therefore it might reduce the computational demand of DP significantly. 

Dyna allows the adjustment of planning relative to collecting experiences: if 
there is no time for planning, steps (4) and (5) can be omitted. Conversely, if real 
experience is slow or costly, then multiple planning steps may be accomplished. 

2.4. Planning with Macro Actions. Another possibility for utilizing additional 
computational capacity is to compile compound actions (macro actions) from basic 
ones. In its simplest form, macros are fixed sequences of actions (e.g. 'go forward, 
turn right, go forward'). Many works have dealt with different versions of the macro 
concept. In particular, one might consider policies of sub-problems (e.g., policy for 
'finding a wall' or policy for 'going to the door') as macros. In this case separate 
value functions on separate parts of the state-space will arise. 

In order to integrate macros into reinforcement learning, two additional steps 
must be implemented: (6) the generation of macro actions and (7) the evalua- 
tion of macro actions. The latter one can be accomplished, e.g., by computing 
Q{x, flinacro) in ^11 ucccssary states. Generating appropriate macro actions is a 
more difhcult problem. One commonly used approach is to pre-wire them by hand 
[Kaelbling, 1993 1. This is a straightforward way to encode prior knowledge into 



the learning problem if such prior knowledge is available and if the environment is 
steady. Efforts have been made also to generate macro actions in an automated 
fashion. The general approach is to use some heuristics to compile a number of 
macros from basic actions, evaluate the macros, and then keep the ones which are 
the most useful. 

A useful set of macro actions can substantially speed up learning, because the 
agent can make larger steps toward its goal. On the other hand, evaluating bad 
macros can deteriorate performance (because trying a bad macro takes a large step 
in a wrong direction). 



3. The pRL algorithm 

In this section a novel algorithm is proposed, which combines DP and macros. 
First, a model is built for model-based value- function updates, just like in Dyna. 
These updates are used directly to find useful macro actions and to calculate their 
utility. A parallel approximate value function is maintained, which encodes the 
macros and their values according to the model. Greedy policy with respect to this 
approximate value function is applied. The resulting policy is then used to choose 
a macro action. 

An important question is the choice of the model. One might attempt to learn 
an approximation of the P{x, a, y) transition probabilities and the corresponding 
rewards. However, estimating and maintaining a table of size \S\'^ ■ \A\ may not 
be feasible. Finally, we think that planning is more effective in near-deterministic 
domains of the state space, and is less advantageous when little is known about the 
outcome of an action. Such problems frequently arise in real life. For example, 
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• when one has to pass a city, one might choose going through the city directly 
or taking the ring around the city. The first option could be less expensive 
in many cases, but large durations on travel time may arise if a traffic jam 
occurs in the inner areas. The second option may allow for more accurate 
planning and may have a competitive cost on the average. 

• Requests can be distributed for mobile agents in many different ways. Plan- 
ning becomes crucial when part of the system may be(come) unreliable. 

Our planning algorithm (the planner) stores only those transitions that are al- 
most deterministic, which we shall call plannable transitions^ The algorithm will 
be called plannable-RL, or pRL, for short. 

3.1. Basic Definitions, the Value Functions and their Update Rules. In the 

following we specify the pRL algorithm. First of all, we define plannable transitions 
formally: 

Definition 3.1. A transition {x,y) is called plannable with accuracy k, if it 
can be realized with probability k, i.e. there exists an action a{x, y) such that 
P{x,a{x,y),y) > k. 

Let us denote the set of states that are plannable from state x by T{x), i.e. 
T{x) := {y : {x,y) is K-plannable} = {y \ 3a{x,y) : P{x,a{x,y),y) > k} 

We assume that we are given an inverse dynamics (j) : X x X ^ A such that 
4>{x,y) = a(x,y), if {x,y) is plannable, and it is arbitrary otherwise. This is a 
reasonable assumption; either the inverse dynamics is known in advance, or, the 
'learning by doing' method can be sufRcient to approximate the inverse dynamics 
for the plannable domains. 

To compute T(.), we approximate P{x, (f>{x, y), y) with -Pt(x, y) using the follow- 
ing approximation scheme: 
(3.1) 

{(1 - at)Pt{x, y) + afl if = x, at = 4>{x, y) and St+i = y, 
(1 - at)Pt{x, y) + afO if x, at = (l>{x, y) and St+i ^ y, 
Pt{x,y) otherwise, 

Po = 1. 

Note that this iteration approximates the exact transition probabilities. A transi- 
tion {x,y) is considered plannable, iff Pt{x,y) > k. These almost-sure transitions 
will be considered as sure by our planning algorithm. The approximation simplifies 
the required computations, and - as we shall show later - provides near-optimal 
solutions. 

Definition 3.2. Connected components of plannable transitions are called 
plannable domains. 

Note that plannable domains are disjoint. In turn, the number of plannable 
domains is smaller than the number of states. 

The immediate rewards of plannable transitions are similarly approximated by 
R{x,y). 



Note that we allow for almost deterministic transitions but we do not assume an almost 
deterministic environment here. 
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In our algorithm, two value functions are maintained: a standard (basic) action- 
value function Q{x,a), a € A, which is updated by real experiences, and another 
value function called the planning value function. The planning value function will 



make use of a Dyna-like algorithm above the simplified model described in Eq. (3.1). 
Both functions suggest - possibly different - policies. In every time step, pRL can 
switch between these policies by examining which one seems better. 

3.2. Updating the basic value function. For updating the basic value function, 
any traditional RL update rule can be applied. For example, one may use DP, Q- 
learning, SARSA, etc. We use SARSA because of its simplicity: 

Qt+i{st, at) = (1 - at)Qtist, at) + at (n + jQt{st+i, at+i)) , 

where St , at and vt are the state of the system, the selected action and the reward at 
time step t, respectively. Qq is arbitrary, and < at < 1 denotes the learning rate. 
Note that the basic value of state x at time t is given by Vt{x) = maxa Qt{x, a). 

3.3. Updating the planning value function. Several update rules could be 
used for the planning value function. However, there is an important difference: an 
(approximate) model and an inverse dynamics is known for this case. Therefore We 
do not need to maintain an action- value table, instead a simple state value function 
can be used. This function will be denoted by V{x). The corresponding policy can 
be determined, e.g., by the inverse dynamics. Note that this policy may differ from 
the policy suggested by the basic value function. 

We chose the value iteration update rule in the following form: 

Vt+i{x) = ma.xS2iP{x,a,y){R{x,a,y) + j'Vtiy)))- 

V 

The number of non-vanishing terms of this update may be considerably smaller 
than the number of all possible terms, because in our approximate model all tran- 
sition probabilities are either or 1. Transitions with low probability are omitted, 
whereas transitions with high probability (determined by k) are considered as sure 
transitions. Note that this simplification could mislead action selection. The error 
depends on the degree of the simplification, which is determined by k: lower k 
results in a coarser model. Assuming that immediate reward i? is approximated by 
i?t, the update rule can be rewritten as 

(3.2) Vt^i(x) = max I max {Rt(x,y) ^ iVt{y)),Vt{x) 

Here we took into consideration that sometimes it is better to select actions 
according to the basic value function Vt(a:) (i.e. continue sampling or quit planning 
and start sampling) than to select action according to the planning value function 
Vtix)) (i.e., continue planning, or quit sampling and start planning). Note that 
when no plannable states are available, one has to return to the basic value function. 
One may think of this technique that the space of plannable actions is extended 
by a new pseudo- action, which could be called 'stop planning^: The choice of this 
action retreats to sampling and to action selection by means of the basic value 
function. 

The update rule is applied for the planning value action in the plannable area 
around the current state x in every time step. This area can be determined by 
a limited breadth-first search on the graph of plannable transitions, starting from 
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state X. The search phase and the update phase requhe no interactions with the 
real system, these are ^ off-line^ evaluations. Due the to limited search and the 
limited number of DP updates, they take only 0{C) steps. 

3.4. Action selection. In classical RL methods (Q-learning, SARSA, etc.), e-soft 
or e-greedy policies with respect to the actual value function can serve to support 



exploration and exploitation (see, e.g., |Sutton and Barto, 1998|). Now we have 



two different value functions which may suggest different policies. We shall make 
use of the e-greedy policy when Qt is used, but if the value of planning from the 
actual state is higher, i.e. Vt{x) > Vt{x), then action will be generated by means of 
the planning value function Vt- This decision is made by pRL in every state. 

3.5. Plannable transitions as macros. Macros are not represented explicitly in 
pRL, only through the planning value function. This is advantageous because we 
do not need much space to store the macros - and we can still store the most useful 
macro with its value for each state. 

The macro encoded by V is "choose the greedy action according to Vt, while 
Vt > Vt- Stop macro, if an action leads to a state other than the planned one. " 

More formally, let 

S(x) := argmax(i?t(x,?/) + I'Viy)) 
y 

be the greedy selection according to V, and let 

a{x) := (f){x, S{x)). 

Then the macro of state x is the action sequence 
a{x),a{S{x)),a{S{S{x))), . . . ,a{S^{x)), . . ., with the stopping condition "stop 
at the nth step, if either Vt{S''+'^ {x)) < Vt{S''+'^ {x)) , or taking action a(S'"(a;)) in 
state S"{x) leads to a state other than S'^'^^(x). 

Naturally, such listing of the macros can be included into the pRL algorithm, 
but is not necessary. 

The pseudo-code of the algorithm is summarized in Fig. 0. 

4. Near-optimality of pRL Macros 

It is easy to see that the basic value function of pRL converges to the optimal 
value function. This follows from th e fact that the convergence theorem for SARSA 



and Q-learning [ Singh et al., 2000 requires (i) sufficient exploration (every (a;, a) 
pair should be visited infinitely often) and (ii) setting the learning rate properly 
(St '^t — oo, but J2t '^t < oo)- Clearly, if these criteria are satisfied with standard 
RL algorithms, then they are also satisfied in pRL. However, convergence to the 
optimal policy is guaranteed only if macros are valued correctly, i.e., if k = 1. 
The rate of convergence can also be seriously affected by k: k < 1 allows for 
a larger set of macros and may converge much faster to (possibly suboptimal) 
solutions. Therefore, it is an important issue how the learning and the utilization 
of macros will influence performance. For the analysis of this problem, an extension 
of the classical MDP model is necessary. For this reason, we briefly introduce e- 



MDPs [Bzita et al., 2002b , and review related theorems. The e-MDP-theory will 



be applied to pRL to show that macros are near-optimal. 
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Initialize Q, V, R, P, Sa, ao, ro , (j> 

t ^ 

REPEAT FOREVER 
Observe St , rt 
Update P using (3.1) 
Update R 
SARSA update: 

Q{st-i,at-i) •*-(!- at)Q(st-i,at-i) + at{rt +^Q{st,at)) 

'I, determine plannable area 

G := {(i, 2/) e 5 X 5 : P{x, y) > k} 

PLAN. ARE A —Breadth-first search in G from St, 
visiting at most TV nodes 

7, update plan-value function 
LOOP M times 

X <r- select a state from PLAN -AREA randomly 

V{x) <r- maxj, Q(x,a) 

V(x) ^ max|maXj^.(^^j^)gG(E(3;,2/) + -t'V{y)),V{x)'^ 
END LOOP 

'/, action selection 
IF V{x) = V(x) OR 

plan length > allowed plan length OR 

no plannable states 
THEN 

at argmaxag^(3(st,a) 
ELSE 

y <- argmax^.(3,,j,)gG(E(a;, y) + -t'V{y)) 

at <- (j){st,y) 
END IF 
t + 1 
END REPEAT 



Figure 1. Pseudocode of the pRL algorithm. In this form, the 
algorithm takes at most 0{N + M) steps for every action selection. 



4.1. e-MDPs. The RL problem can be extended so that the environment is no 
longer required to be an MDP, it is only required to remain 'near' to an MDP, i.e. 

the environment is allowed to change over time, even in a non-Markovian manner. 

The closeness of two environments (which have the same state- and action-sets) is 
measured by the distance of their transition functions. Wo say that the distance of 
two transition functions P and P' is e-small {e > 0), if ||-P(a:;, a, ■) — P'(x, a, .)||li < £ 
for all {x,a), i.e. J2y \P{^^^jy) ~ P'{x,a,y)\ < s for all {x,a). (Note that for a 
given state x and action o, P{x, a, y) is a probability distribution over y e X.) 

A tuple {X,A,{Pi},R) is an e-MDP with e > 0, if there exists an MDP 
{X, A, P, R) (called the base MDP) such that the difference of the transition func- 
tions P and Pt is £-small for alH = 1, 2, 3, 
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As a simple example for an e-MDP, consider an ordinary MDP perturbed by a 
small noise in each time step. 



4.2. Convergence of Learning Algorithms in e-MDPs. One expects that such 
small perturbations in the environment may not disturb the performance of the 
learning algorithms very much. Nevertheless, one cannot expect that any algorithm 
finds an optimal value function in an e-MDP. (Such a solution may not even exist 
because of the perturbations of the environment). However, we shall guarantee that 
these algorithms find near-optimal value functions. Formally, the task is to show 
that limsupj^oQ const-e (or limsupj^j^ < const- e),^ where 

V* and Q* are the optimal value functions of the base MDP. 
First, consider Q-learning. Recall the update: 

(4.1) Qt+i{xt,at) = (1 - at{xt,at))Qtixt,at) + at{xt, at)(rt + 7 max (3t(yt, a)), 

a 

where yt is selected by sampling, i.e., according to the probability distribution 
Pt{xt,at, .). 

Theorem 4.1. Let Q* be the optimal value function of the base MDP of the e- 
MDP, and let M ~ maxj;.a Q*{x, a) — minj;.a Q*{x, a). If 

(1) every state-action pair {x, a) is updated infinitely often, 

(2) the learning rates satisfy J2tLoXi^t = x,at — a)at{x,a) — 00 and 
'^u^oXi^t=x,at—a)at{x,a)'^ < 00 uniformly w.p.l^ 

then limsupj_^^ \\Qt -Q*\\ < j^lMe w.p.l. 



The proof can be found in Szita et al., 2002a 



The case of SARSA. The proof of the ncar-optimality of the SARSA update 



is similar to the proof of Q-learning. Here it suffices to refer to | Singh et al., 2000 1 



It has been proven for MDPs in | Singh et al., 200C| that if Q-learning converges 



to the optimal value function, then - under the same conditions - SARSA con- 



verges, too. The proof in | Singh et al., 200C] carries over directly to e-MDPs 
[Szita et al., 2002a| . 



4.3. pRL in the e-MDP Framework. Recall that for finding macro actions, we 
have applied learning methods that find optimal policies and value functions in 
the modified environment^ where almost sure transitions (with probabilities greater 
than k) are treated as sure (probability 1). It may be worth noting that according 
to Eq. |3.2| the value function of the model (i.e., V) is inherited from the original 
value function. V{x) can be different from V{x) iff P{x^ y) is modified by the model. 
Note that modification of P(x, y) is at most e. Therefore the modified environment 
is an e-MDP of the original one, with e = 1 — k, so one can apply the e-MDP-theory 
to prove the following: 

Corollary 4.2. The pRL algorithm that uses either DP, Q-learning or SARSA 
for the learning of the macro-value function, produces approximations Vt such that 
limsupt^oo \\Vt - V*\\ < const - (1 - k). 

Consequently, the macro actions found by pRL are asymptotically near-optimal. 
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(a) Immediate rewards of the example 
labyrinth. The colors mark the gained im- 
mediate reward when the agent arrives in a 
the plotted states. Reaching the goal state 
means an immediate +200 reward (this one 
is not depicted on this picture). 



START 




5 10 15 20 25 30 35 40 



(b) Probabilities in the example labyrinth. 
The colors mark the probability by which 
an action leads to the 'good' state from the 
plotted states, e.g. if selecting the action 
'NORTH' the agent in a given state, the 
agent really arrives in the state lying north- 
ward. If the transition is not successful 
in this sense, the agent arrives in a 'bad' 
state chosen with equal probability from 
the other possible next states. 




5 10 15 20 25 30 35 40 



(c) Plannable areas at different k 



Figure 2. 



5. Experiments 

To test pRL, a 40 X 40 maze was generated. The agent starts in the upper left 
corner and its goal is in the lower right corner. In every state, the agent observes 
its position and can take four different actions (N(=north),S(=south),E(=east), 
and W(=west)). In every state s, an action is successful with probability P^^'^^{s) 

^Unless otherwise noted, ||.|| denotes the max-norm. 

■^X denotes the indicator function, xi'^ondition) is 1 if the condition is true, and otherwise. 
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Figure 3. Convergence Speed of pRL 

Learning curves for three different kappa values. Dotted line: 
K = 0.05, solid bold line: k = 0.15, dashed bold line (SARSA): 
K = 1.0. The curves represent averaged step numbers using av- 
eraging over 500 steps. Planning converges much faster then the 
original SARSA in the early phase. Convergence, however, contin- 
ues beyond 2000 trials. 



if the attempted direction is successfully executed, and it fails with probability 
1 — P'*"'^'^(s) when the agent steps to a random, wrong direction. A lower bound 
for P^^^^[s) was set to 0.7. Regions with higher P^"'^'^(s) were generated. The 
probability of P^'^^^{s) was gradually increased to 1 within these regions. The 
agent received a small (—0.1) negative reward in each state, except for the goal 
state, where it received the reward of -f 200. In addition, some pitfall domains were 
generated randomly. Every domain contributed to the local reward by —1. In case 
of overlaps amongst the domains rewards were cumulated. After reaching the goal, 
a new episode was started. An example problem is shown in Fig. ^. 

In the experi ments, SARSA was utilized with e-gree dy action selection, with el- 



igibility traces ([ Sutton, 1988 , Singh and Sutton, 1996 |). The following parameters 



were used: learning rate was held constant at a = 0.001. The eligibility decay 
A was set to 0.95, discount factor 7 was equal to 0.98, the probability of random 
action selection was 0.1. For pRL, the same setting was used and several k values 
were tried. The number of updates was set to 10. 

5.1. Convergence speed. In theory, by setting k = 1, pRL becomes identical 
to SARSA, and by setting k = 0, pRL becomes equivalent to the simplest Dyna 



algorithm (see B.l). By choosing an appropriate pRL learns at least as quickly 
as the better candidate, thus it approximates the convergence rate of Dyna - which 
usually converges in a few trials in this case - even at low k values (Fig. ^). Problems 
may arise if it is not known whether the best solution corresponds to total planning 
(Dyna) or if no planning is possible at all (SARSA). By choosing an appropriate k. 
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Figure 4. Optimality of pRL 

Final performance of the resulting policy as a function of k values. 
Upon training has converged, the performance of the algorithm 
was measured by averaging the number of steps (in 10000 trials) 
required to finish one trial. The horizontal line depicts the optimal 
solution found by SARSA (k = 1). 



(or a set of k's) the convergence rate of pRL can always achieve these boundaries. 
The method is most useful if computations for different k's can be afforded. 

5.2. Optimality. The second experiment demonstrates the near-optimality of 
pRL. The performance of the resulting policy for different k values were comp uted 
(Fig. U). The question is the value of the constant multiplier of Corollary 4.2 



whether it is too high or not. Here, the assumption about deterministic transitions 
does not significantly influence the performance for high k values, as expected. It 
was also found that the resulting policies perform quite well for some fairly low k 
values; for example, k « 0.5 performed still reasonably well. This suggests that 
better (stronger) estimations might exist for this problem. However, there is a k 
region, which exhibits poor performance. This region is around k = 0.7. The poor 
performance is the consequence of the special properties of our toy problem. In this 
K region, learning is compromised by the fact that most transition probabilities had 
a value of 0.7 (except in the plannable domains), and thus the algorithm had large 
uncertainties whether a particular transition is plannable or not. 

6. Discussion 



In this work, we introduced a new algorithm called pRL, which integrates plan- 
ning into reinforcement learning in a novel way. An attractive property of our 
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DP is unexpensive 
new data is expensive 


DP is expensive 
new data is unexpensive 


DP is expensive 
new data is expensive 


SARSA 


slow, optimal 


fast, optimal 


slow, optimal 


Dyna 


fast, optimal 


slow, optimal 


slow, optimal 


simple Dyna 


fast, not optimal 


slow, not optimal 


slow, not optimal 


pRL 


fast, near-optimal 
(low k) 


fast, near-optimal 
(high k) 


fast, near-optimal 



Table 1. Comparison of RL methods for various problem types. 
'Cheap DP' means that DP iterations have low computational cost, 
e.g., because the state space is relatively small. 'New data is cheap' 
means that the agent can easily obtain new experience by inter- 
acting with the environment. 'Optimal' means that the method 
eventually converges to an optimal policy. 



algorithm is that it can find near-optimal value function and near-optimal policy un- 
der certain conditions. Furthermore, the macros created by pRL are near-optimal. 
Near-optimality is controlled by the certainty of planning: our algorithm deals with 
fcappa-plannable transitions, which have transition probabilities < k < 1. 

Our algorithm was illustrated by computer simulations on a toy problem. It 
was found that SARSA is slow, but it leads to optimal solutions. On the other 
hand, Dyna may be fast but the resulting policies could be poor. Our algorithm 
incorporates both SARSA and Dyna. These are the extremes, SARSA corresponds 
to K = 1, whereas the original form of Dyna | Sutton, 1991 1 is recovered when 
K = 0. By tuning k in pRL and by using the arising computationally inexpensive 
model of pRL, one can quickly find almost optimal solutions. This feature is most 
advantageous when off-line computations are inexpensive. Comparisons amongst 
the different methods are provided in Table |l|. 

pRL fits well into existing planning paradigms: it can be seen as a priori- 
tized sweeping method where only the plannable states (i.e. states which can 
be reached almost deterministically from the current state) are updated. In 
turn, our method is a particular (extended) form of prioritized sweeping (see, e.g. 
[Moore and Atkeson, 1993|). One can decide to select updates according to state 
values Ivioore and Atkeson, 1993|] , to select update s ordered by increasing di stance 
from the current state or by prediction difference [ Peng and Williams, 1993 |. Pri- 
oritized sweeping can dramatically improve the performance of planning. Using our 
method, one cane also have performance bounds. 

pRL can support approaches aiming to solve partially observable Markov decision 
processes (POMDPs). Most practical environments are inevitably POMDPs, which 
appear to the Markovian agent in the form of highly non-deterministic transitions. 
POMDPs are generally solved by extending the state representation of the agent 
(e.g. by estimating the hidden state variables). pRL can be used as a 'pre- filtering' 
technique. One can decide to use pRL first and to decouple the almost deterministic 
domains of the state space, where - as a first approximation - no further extension 
of the representation is necessary. 

Another promising area for pRL is seen in the novel E-learning method 
|L6rincz et al., 200^ , Szita et al., 2002b| , Szita et al. , 2002a . E-learning is a natural 
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candidate for pRL when trajectory planning is desirable. E-learning is a modifica- 
tion of traditional reinforcement schemes. In E-learning, the selection of an 'action' 
is equivalent to selecting a desired next state |L6rincz et al., 2002 1. After this se- 
lection is made, the desired state is passed to an underlying controller. E-learning 
views the controller (the policy, or an inverse dynamics, or both) as part of the 
environment. pRL using E-learning develops estimations of state-state transition 
probabilities. In turn, the set of plannable states is directly available in E-learning. 
Thus pRL is greatly simplified within the framework of E-learning. 

In our simulations, the discount rate for the planning (7') was equal to the 
discount rate used in SARSA (7). A simple but useful extension arises if one allows 
for different discount rates. This could be used for a number of purposes. If 7' > 7 
is our choice, then planning may cover larger domains. On the other hand, 7' < 7 
means that planned actions will achieve larger short-term rewards. Thus one can 
tune the interaction of planning and reinforcement learning by setting the discount 
rates, and determine whether to use planning or direct reinforcement learning to 
achieve short-term and long-term goals. 
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